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Abstract 

Background: The aim of the present study was to investigate the usability of verbal rating scale anchors in patients 
suffering from a depressive episode and whether differences between frequency or intensity scales could be 
determined. Frequency and intensity terms were evaluated concerning their interindividual congruency, 
intraindividual stability across time, and distinguishability of adjacent terms. 

Methods: In a longitudinal design, 44 patients (age M=39.1, SD=15.2, 68.2% female) with a depressive disorder 
filled out several established questionnaires (e.g. BDI or SCL-90) and questionnaires containing frequency and 
intensity terms which should be indicated by the percentage of time or intensity that is reflected by each term at 
two different measuring times within one week. Data analysis contained t-tests for paired samples and effect sizes 
d according to Cohen. 

Results: Intensity terms showed weaker intraindividual stability across time as compared to frequency terms. 
Participants were able to reliably distinguish four frequency and intensity terms at both measuring times. Overall 
congruency between patients was larger for intensity terms in comparison to frequency terms. 

Conclusions: The present results indicate that both frequency and intensity terms can be applied as verbal anchors 
for clinical self-report scales. However, if longitudinal assessment is intended, our results indicate that frequency 
terms should be used as they showed slightly greater stability across time. Generally, the present study suggests 
that no more than four different verbal anchors should be used together in rating scales as especially older patients 
and those with low lexical experience would not be able to reasonably differentiate more than these. 



Background 

The gold standard for diagnosing a mental disorder is a 
structured diagnostic interview that examines whether 
the criteria for a respective disorder are fulfilled accord- 
ing to the ICD-10 or DSM-IV [1,2]. In addition, to assess 
the severity of a mental disorder in case that a diagnosis 
has already been assigned, numerous standardized 
instruments are available, mainly self-rating scales (ques- 
tionnaires) or scales to be filled in by the diagnostician. 
These questionnaires usually contain a set of statements 
referring to symptoms associated with the respective dis- 
order together with a rating scale. Respondents are 
requested to mark on the scale how intensive they ex- 
perience the symptom or how frequently the symptom 
occurred in a defined time slot. Most rating scales ap- 
plied in questionnaires differ with regard to number of 
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categories, point of origin and labeling. Some scales are 
exclusively labeled by numbers (e.g., from 1 to 5), others 
are labeled verbally (e.g., from "never" to "always") or 
both. Over the last years a considerable amount of re- 
search emerged that addressed the effect of several rat- 
ing scale attributes on the responses of the test taker 
(for an overview see e.g., [3]). 

For example, scales which are labeled by only positive 
numbers (e.g., from 0 to 6) get different answers than 
scales with negative and positive numbers (e.g., from -3 
to +3). Furthermore, attributes like the number of re- 
sponse options, polarity or whether a rating scale 
includes "0" as response option have influence on the re- 
sponse behavior of test takers [4-8]. However, despite ex- 
tensive research on attributes of rating scales and their 
impact on response behavior, the question how to 
choose verbal anchors for clinical rating scales has 
mostly been disregarded and the question whether clin- 
ical self-report instruments should rather ask for the 
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frequency or intensity of symptoms has not been investi- 
gated empirically so far. 

In clinical examination both intensity and frequency of 
a symptom are important and should be accounted for 
by the treating therapist. In addition, the relevant diag- 
nostic criteria as described in ICD-10 or DSM-IV usually 
include both the frequency and intensity of symptoms. 
However, when designing a self-report instrument diag- 
nosticians have to decide which of the two dimensions 
to consider more important to asses. Both DSM-IV and 
ICD-10 offer no clear advice which of the two options 
should be chosen. 

A couple of earlier studies from the area of research 
on medical education addressed this issue from a differ- 
ent perspective. Case [9] asked members of test commit- 
tees who write questions for medical examinations to 
indicate the percentage of time or intensity that was 
reflected by imprecise terms of frequency (e.g. usually) 
which are commonly used in multiple choice questions. 
Contrary to the assumption that there is a common def- 
inition among medical professionals about the phrases 
used the results showed virtually no congruence between 
the professionals' rating. 

Other studies revealed that even terms like "never" or 
"always" which were expected to be rated as absolute 
"0%" or "100%" both were indicated with a range up to 
20% [10]. 

In a study about the measurement of fatigue Chang 
et al. [11] analyzed data from 161 patients (cancer, stroke 
and HIV) on two 5-point symptom self-report rating 
scales, one for frequency and one for intensity. Applying 
Rasch analysis they found a subtle but meaningful ad- 
vantage for frequency terms providing a fuller coverage 
of the fatigue continuum. The authors argued that their 
results could be interpreted as indicating that frequency 
scales outmatch intensity scales psychometrically. 

Taken together, despite its importance for the design 
of clinical instruments, the question whether frequency 
or intensity should be used as verbal anchors for self- 
report rating scales has only rarely been addressed so 
far. Results suggest that interindividual congruency of 
mental representations of imprecise terms on frequency 
or intensity generally appears to be low. However, if 
respondents are not able to reliably distinguish between 
terms like "seldom" and "sometimes", the question 
arises, whether it is justified to use them in a common 
scale which is often the case in clinical practice. Thus, it 
remains an open question which criteria should be ap- 
plied to decide (1) whether a rating scale should be 
scaled for intensity or frequency and (2) which terms 
should be "allowed" to be used together in a common 
scale. 

Therefore, the aim of this study was to search for an em- 
pirical basis for criteria to decide whether frequency or 



intensity scales should be used in clinical self-report 
instruments. Data from patients with a depressive disorder 
were acquired for this purpose. We proposed that 
imprecise terms used as verbal anchors in rating 
scales should at least adhere to the following basic 
requirements: 

1. interindividual congruency of mental representations 
of anchor terms 

2. intraindividual stability across time of mental 
representations of anchor terms 

3. distinguishability of adjacent terms 

These issues were examined for both terms on fre- 
quency and intensity. In the light of prior investigations 
we expected low interindividual congruency and intrain- 
dividual stability of mental representations. Practical 
implications for scale development and refinement as 
well as suggestions which terms should be allowed in a 
common rating scale are discussed. 

Methods 

Sample 

A total of 44 patients from a German university hospital 
and several community psychiatry clinics suffering from 
a depressive disorder according to ICD-10 as leading or 
secondary diagnosis provided data for this study. A fur- 
ther inclusion criterion was sufficient German language 
skills. Depression was chosen because it represents one 
of the most common and thus most important groups of 
mental disorders [12-14]. 

When applying statistical power analysis (e.g. using 
the software G*Power 3.1.3, estimation for point biserial 
correlations [15,16]) with N = 44 and a = 0.05, the de- 
sign of the present study has enough power (l-(3 = 
0.96) to detect medium sized effects (d > 0.5). Of 
course, when considering smaller effects power de- 
creases. However, the present study intended to provide 
first data on the question whether patients with a de- 
pressive disorder show interindividual congruency and 
intraindividual stability when judging imprecise fre- 
quency and intensity terms and to derive suggestions 
on how many and which such terms may be used in a 
common scale. Rigorous criteria were considered im- 
portant for this purpose. Thus, we feel it most import- 
ant to prevent type-1 errors, i.e. the erroneous rejection 
of the H 0 , while accepting a slightly heightened type-2 
error. 

The mean age of the sample was M=39.1 years 
(SD=15.2) with a range from 17 to 78. Fourteen male 
and 30 female patients (68.2%) participated in the study. 
For sample details see Table 1. The study was approved 
by the local ethic committee and performed according 
to the Declaration of Helsinki. 
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Table 1 Characteristics of the study sample 



age 
BDI tr 
BDI t2* 

SCLGSI (t value) t1* 
SCL GSI (t value) t2* 

sex 

male 
female 

nationality 

German 
Turkish 

first language 

German 

Turkish 

Slovakian 

years of education 



diagnoses 



*test interval t2-t1 = one week. 

Material 

Beck Depression Inventory (BDI) 

The BDI [17] contains 21 items. Each item consists of 
four self-referring statements (e.g. "I am sad"). Item 
scores range from 0 to 3 and participants are supposed 
to choose one or more statements per item that repre- 
sents best their mental state during the last week. A total 
score >10 indicates mild to moderate depression and a 
total score >18 moderate to severe depression. 

The Symptom Checklist-90-revised (SCL-90) 

The SCL-90 [18] contains 90 items that are Likert- 
scaled, referring to the previous week, with a range from 
0 ("not at all") to 4 ("very much"). The instrument 



M SD 

39.1 15.2 

31.2 11.0 
26.9 10.7 
68.1 10.3 
70.7 9.3 

n percentage (%) 

14 31.8 

30 68.2 

43 97.7 

1 2.3 

42 95.5 

1 2.3 

1 2.3 

1 2.3 

24 54.5 
16 36.4 

25 56.8 
18 40.9 
9 20.5 
4 9.1 
3 6.8 

2 4.5 
2 4.5 
1 2.3 
1 2.3 
1 2.3 



provides information on overall psychological distress. 
Furthermore, the 90 items of the inventory constitute 
three composite scores and nine symptom scales (Soma- 
tisation, Obsessive-Compulsive, Interpersonal Sensitivity, 
Depression, Anxiety, Hostility, Phobic Anxiety, Paranoid 
Ideation, Psychoticism) allowing the calculation of psy- 
chopathological profiles. The three composite scores re- 
flect the complete answer pattern of the respondent: the 
"global severity index" (GSI) measures the overall mental 
symptom burden, the "positive symptom distress index" 
(PSDI) measures symptom intensity, and the "positive 
symptom total" (PST) reflects the total number of the 
respondent's symptoms. The raw scale and composite 
scores are transformed to standardized T-scores with a 



<10 years 
10-13 years 
>13 years 

(multiple diagnoses possible) 
Depressive episode (F32.xx) 
Recurrent depressive disorder (F33.xx) 
Disorder of adult personality (F6x.xx) 
Other anxiety disorder (F41.xx) 

Mental disorder due to psychotic substance use (F1x.xx) 
Bipolar Affective Disorder (currently depressed; F31.3x/ F31.4x) 
Persistent affective disorder (F34.xx) 
Adjustment disorder (F43.2x) 
Schizophrenia (F2x.xx) 
Agoraphobia (F40.0x) 
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mean of 50 and a standard deviation (SD) of 10. T- 
scores >60 reflect heightened mental burden. For our 
analyses we only used the GSI as indicator of general 
symptom burden. 

German vocabulary test ("Wortschatztest", WST) 

The WST [19] contains 40 items. Each item consists of 
six words of which only one is a real word which can be 
found in a German dictionary. The participants are sup- 
posed to choose and highlight the real word. The num- 
ber of correctly chosen real words creates the raw score 
which can be linearly transformed into various standar- 
dized scores. In this study, a transformation into z- 
scores (M=0, SD=1) was performed. 

Questionnaires about frequency and intensity 

In a first step of the construction of material for the 
measurement of mental representations of terms on fre- 
quency and intensity a pool of terms commonly used in 
rating scales of self-report instruments was compiled. 
For this purpose, established questionnaires in German 
(34 using intensity, 35 using frequency scales; a complete 
list of all screened questionnaires is available on request 
from the principal author) were screened resulting in a 
pool of fifteen terms on intensity (e.g., "very much") and 
fourteen terms on frequency (e.g., "sometimes"). Only 
those terms were included which showed an appearance 
in at least 10% of the screened questionnaires to create a 
nearly equal number of phrases for both frequency and 
intensity. The threshold of 10% was also chosen to pre- 
vent that random phrases which appear only in single 
questionnaires would be assessed in the study. Patients 
were asked to indicate the percentage of time or inten- 
sity that is reflected by each term. 

Further materials 

All patients completed a demographic data sheet. Clin- 
ical data were taken from a data sheet which was filled 
out by the treating therapist. 

Procedures 

Questionnaire fulfilment was explained and supervised 
either by the principal author or the treating psychother- 
apist. All patients took part voluntarily without payment 
and signed an informed consent prior to testing. The 
therapists received a reward of 10€ for recruiting, admin- 
istering and returning the questionnaire packages. 

Patients were required to fill in the BDI, SCL-90 and the 
questionnaires about frequency and intensity twice within 
an interval of one week (ti and t 2 ). WST and demographic 
data sheet were administered only once at ti. 



Data analysis 

Interindividual congruency of mental representations of 
anchor terms 

Congruency of mental representations of frequency 
terms was compared to intensity terms by means of t- 
tests for paired samples and effect sizes d between the 
mean standard deviation of frequency terms (SDf req ) and 
intensity terms (SD int ) and their confidence intervals 
(95%). If the confidence interval for d includes zero, the 
effect can be regarded as statistically nonsignificant. In 
order to reduce sampling error effect sizes have been 
corrected using a factor provided by Hedges and Olkin 
[20]. Following Cohen [21] effect sizes .20<d<.50 were 
interpreted as small, .50<d<.80 as medium, and d>.80 
as large. 

Furthermore, we investigated whether interindividual 
congruency differed in dependence on patients' age, gen- 
der, vocabulary, depression (BDI) or overall mental 
symptom burden (GSI). For this purpose, the sample 
was divided by median split on the respective variable 
(e.g., age) and pair wise comparisons using t-tests and 
effect sizes d were conducted. 

Intraindividual stability across time of mental 
representations of anchor terms 

For the determination of the intraindividual stability of 
mental representations of anchor terms patients' ratings 
on frequency and intensity terms on ti and t2 were com- 
pared using t-tests for paired samples and effect sizes d 
according to Cohen. Significant t-tests and effect sizes d 
>.20 that do not include "0" were considered as signs of 
intraindividual instability of mental representations. In 
order to identify the phrases which show a strong 
intraindividual stability it was considered important to 
apply rather strict standards. So even the smallest effect 
plus significance in the students' t-test were considered 
as an indication for instability. 

Distinguishability of adjacent terms 

To assess patients' ability to distinguish adjacent terms 
analysis of effect sizes and their confidence intervals were 
calculated according to the method used to assess inter- 
individual congruency. Following Cohen effect sizes 
.20<d<.50 were interpreted as small, .50<d<.80 as medium, 
and d>.80 as large. The number of distinguishable adja- 
cent terms was determined for small, medium and large 
effects between terms separately. 

Results 

Interindividual congruency of mental representations 

Overall interindividual congruency was poor for both fre- 
quency (M SDt i= 20.06; M SDt2 = 19.90) and intensity terms 
(MsDti= 18.31; M SDt2 = 15.68) but larger for intensity 
terms in comparison to frequency terms. Its difference 
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was significant only for t2 fa: d=.26, CI [-.16 - .68], t=.30 
(p=.77); t 2 : d=.78, CI [.34 - 1.21], t=2.57 (p=.02)). The 
congruency of both frequency and intensity terms was 
influenced by age and gender. Younger (age|) and male 
patients showed a larger congruency than older (agef) 
and female patients (Tables 2 and 3). 

Intraindividual stability of mental representations 

On single item level most terms showed sufficient 
intraindividual stability considering that both t-tests and 
effect sizes showed no statistical significances. According 
to the performed t-tests, intensity terms showed intrain- 
dividual instability for three terms (no, not all, intense) 
while there was only one frequency term (often) which 
was not intraindividual stable (Figures 1 and 2). 

When aggregating data, i.e. when averaging the stand- 
ard deviations of all items, intraindividual stability across 
time was evident for both frequency terms and intensity 
terms. However, patients reporting higher levels of de- 
pression as indicated by a high BDI sum score at ti 
(BDI t x |) and patients reporting higher levels of general 
mental distress as indicated by a high SCL GSI score at t 2 
(SCL t 2 t) judged intensity terms more heterogeneously 
at ti than at t 2 according to effect size (t BDI | =2.474 
(p=.027) and ES SC lT=-992 (CI .299 - 1.644) respectively) 
(Tables 4 and 5). 

The data on single item level for every split group can 
be found in the Additional files 1 and 2. 

Distinguishability of adjacent terms 

The distinguishability of adjacent terms was tested for 
the whole group and for each median split subgroup. 
The number of distinguishable adjacent terms was deter- 
mined for small, medium and large effects between 
terms separately. Details on this analysis can be found in 
Tables 4 and 5. In the following, main results will be 
summarized. 

In the whole sample the patients were able to distin- 
guish five to seven frequency terms depending on the 



effect size criterion applied and five to eight intensity 
terms at both time points. 

WST: The patients with lower vocabulary skills (WST j) 
were able to differentiate four to six frequency terms and 
five to seven intensity terms while patients with a higher 
vocabulary (WSTf) were able to distinguish six to seven 
terms for both frequency and intensity terms at both 
time points. 

Age: Five to seven frequency terms and six to eight in- 
tensity terms at both measuring times could be distin- 
guished by the younger patients (agej.), for the older 
patients (agej) fewer terms (four to five frequency terms 
and five to six intensity terms) could be considered. 

Gender: The distinguishability of terms of the male 
patients was given for nine fa) respectively six (t 2 ) fre- 
quency and eight fa) respectively seven (t 2 ) intensity 
terms. Female patients could only distinguish four to six 
frequency and five to seven intensity terms. 

BDI: In relation to the BDI sum there was no clear dif- 
ference between the split groups. The terms that could 
be distinguished differed about one term more for inten- 
sity than frequency. 

SCL: The SCL split groups showed no clear difference 
in relation to the distinguishability at the two time 
points, as well as to frequency and intensity terms. 

Discussion 

The aim of the present study was to investigate the us- 
ability of verbal rating scale anchors in patients suffering 
from a depressive episode and to search for an empirical 
basis for criteria to decide whether frequency or inten- 
sity scales should be used in clinical self-report instru- 
ments. Three criteria were applied to compare the 
appropriateness of using frequency as compared to in- 
tensity terms in self-report rating scales: (1) interindivi- 
dual congruency of mental representations of terms, (2) 
intraindividual stability across time of mental represen- 
tations of terms, and (3) distinguishability of adjacent 
terms. 



Table 2 Interindividual congruency of mental representations of frequency terms 





M?(SD,) 


M£(SD 2 ) 


t(p) 


d(CI) 


M*(SD-|) 


M£(SD 2 ) 


t(p) 


d(CI) 


WST 


18.61(5.54) 


19.70(9.23) 


-.37(72) 


-14(-73-.45) 


21.67(6.49) 


1 7.37(7.90) 


1.52C14) 


.59(-.02-1.18) 


age 


1458(5.07) 


23.53(7.46) 


-3.58(.00)* 


-1.40(-2.03-.71) 


16.81(5.34) 


21.68(6.97) 


-2.00(.06) 


-77(-1.39-.15)° 


sex 


944(4.68) 


22.62(6.57) 


-5.82(.00)* 


-2.16(-2.9-1.35)° 


13.33(7.24) 


21.92(6.57) 


-3.1 7(.00)* 


-1.27(-1 .93-56)° 


BDItl 


19.46(7.35) 


19.55(4.95) 


-.03(57) 


-.01 (-.61 -.58) 


18.13(7.01) 


19.51(6.55) 


-52(.61) 


-.21 (-.80-40) 


BDIt2 


22.26(7.80) 


16.42(5.77) 


2.17C04)* 


.85(.20-.48)° 


20.92(6.85) 


17.71(6.46) 


1.23(.23) 


.48C-.15-1.09) 


SCLtl 


20.08(7.80) 


15.75(6.63) 


1.53(.14) 


.60(-.90-1.26) 


1 7.33(8.50) 


1 9.63(6.89) 


-76(46) 


-.30C-.96-.37) 


SCLt2 


18.02(9.26) 


19.97(4.01) 


-.13(74) 


-14(-.79-.52) 


14.67(8.93) 


21.10(7.31) 


-2.01 (.06) 


-79(-1.45-.19)° 


* Ml refers to: WS"U, agel, male, BDI ti J., BDI t2 {, SCL ti { 
" p<.05. 


and SCL t2 J.; M2 refers to: WSTT, aget, female, BDI ti T, BDI t2|, SCL ti T and SCL t2|. 





1 If the effect size d is larger than .20 and the conf idence interval for d does not include zero, the effect can be regarded as statistically significant. 
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Table 3 Interindividual congruency of mental representations of intensity terms 



t, t 2 





M?(SD,) 




t(p) 


d(CI) 


Mt(SD,) 


M^(SD 2 ) 


t(p) 


d(CI) 


WST 


1 7.94(7.23) 


17.63(10.33) 


,09(.93) 


,03(.-56-.62) 


21.67(6.49) 


13.90(5.67) 


1 .68(.1l) 


1 ,28(.60-1 .91)° 


age 


11.86(4.55) 


22.29(10.70) 


-3.47(.00)* 


-1.18(-1. 89-43)° 


11.63(5.26) 


1 8.08(6.67) 


-2.94(.01)* 


1. 05 (-1.73-3 1 )° 


sex 


8.13(5.47) 


20.60(9.44) 


-4.43C00)* 


-1.48(-2.16-.75)° 


7.90(5.04) 


1 7.69(4.87) 


-5.41 (.00)* 


-1.99(-2.72-1.18)° 


BDItl 


1 7.39(7.88) 


18.04(8.80) 


-22(.83) 


-.08(-.68-.52) 


13.35(5.67) 


14.82(5.40) 


-.73(47) 


-,27(-.87-35) 


BDIt2 


1 9.44(8.67) 


16.41(8.58) 


.96(34) 


35(-.27-.96) 


13.64(5.73) 


14.77(6.01) 


-.53060) 


-19(-.81-.43) 


SCU1 


1 9.44(8.67) 


16.41(8.58) 


.96(34) 


35(-.27-.96) 


17.69(5.00) 


1 5.45(6.74) 


.74(47) 


38(-.31-1.05) 


SCLt2 


19.41(10.12) 


1 8.98(6.94) 


.21(.84) 


.02(-.64-.67) 


16.15(5.34) 


12.73(5.59) 


.82(42) 


.63 (-.60- 1.28) 



* Ml refers to: WSTJ., age|, male, BDI tl I, BDI t2 I SCL tl J and SCL t2 J; M2 refers to: WSTT, agef, female, BDI tlf, BDI t2|, SCL t1| and SCL t2T. 

* p<.05. 

° If the effect size d is larger than .20 and the confidence interval for d does not include zero, the effect can be regarded as statistically significant. 



All in all, the reported results do not give a clear pic- 
ture on whether frequency or intensity terms should be 
preferred as verbal anchors in rating scales. Intensity 
terms showed a larger congruency than frequency terms, 
but however both congruencies were rather low. The 
congruency of both frequency and intensity terms was 
influenced by age and gender. Male patients and 
younger patients seem to show a higher agreement in 
comparison to female and older (>38 years) patients in 
regard to this criterion. However, the majority of patients 
with a depression are female and older than 50 years [13] 
and this group showed particularly low congruency when 
evaluating imprecise terms in the present study. This 
should be kept in mind when using self-report instru- 
ments and should encourage clinicians not to rely on 
questionnaires alone but rather apply structured diagnos- 
tic interviews for diagnostic purposes more frequently, 
especially in this patient population. When developing 
new questionnaires diagnosticians might want to con- 
sider applying those terms that showed reasonable con- 
gruency for older and female patients, too. 



120 
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80 
70 
60 
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30 
20 
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0 
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t ^ 2.55 
(p = .014) 



ra 0? ro 



Figure 1 Intraindividual stability of mental representations of 
frequency terms. 



In comparison to frequency terms intensity terms 
showed a higher number of intraindividual instable 
terms (three vs. one) and instability was additionally 
influenced by two of the examined additional variables 
(depression, overall mental symptom burden). This can 
be interpreted as indicating, that participants differed 
more in their mental representations of intensity terms 
than of frequency terms and that severity of mental 
symptoms (especially depression) inflated these differ- 
ences more clearly for intensity than for frequency 
terms. So, concerning interindividual stability frequency 
terms appear to be slightly superior to intensity terms. 

Concerning the distinguishability of adjacent terms no 
clear general advantage for neither frequency nor inten- 
sity terms could be determined. Assessing the distin- 
guishability for the overall group there seemed to be a 
slight advantage for intensity terms. Considering only 
the strictest criterion of d>.80 there was no difference: 
for both frequency and intensity terms five terms could 
be distinguished. The distinguishability of both fre- 
quency and intensity terms seemed to be influenced by 
age and gender as well as lexical experience. Again, par- 
ticularly older and female patients and those with low 
lexical experience showed poorer ability to discriminate 
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Figure 2 Intraindividual stability of mental representations of 
intensity terms. 
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Table 4 Distinguishability of adjacent terms and intraindividual stability of frequency terms 

distinguishable term intraindividual stability 



"ti r>t2 





20<d<.S0 


.50<d<.80 


d>.80 


.20<d<.50 


.50<d<.80 


d>.80 


IVMSD,) 


M 2 (SD 2 ) 


t(p) 


d(CI) 


all 


7 


6 


5 


6 


6 


5 


20.06(5.10) 


1 9.90(6.34) 


.20(.85) 


.03C-.39-.45) 


WSTj 


6 


6 


5 


4 


4 


4 


18.61(5.54) 


21.67(6.49) 


-1.43018) 


-5K-1.11-.12) 


WST| 


/ 


/ 


7 


6 


6 


6 


19.70(9.23) 


17.37(7.90) 


1 .50016) 


.27(-.31-.85) 


agel 


/ 


/ 


6 


6 


6 


5 


14.58(5.07) 


16.81(5.34) 


-1.83(.09) 


-.43 (-1. 03-. 19) 


age! 


5 


5 


4 


4 


4 


4 


23.53(7.46) 


21.68(6.97) 


.71(.49) 


.26(-.34-.85) 


male 


9 


9 


9 


6 


6 


6 


9.44(4.86) 


1 3.33(7.24) 


-1.39086) 


-.63 (-1.3 7-. 15) 


female 


6 


6 


5 


6 


6 


4 


22.62(6.57) 


21.92(6.57) 


.81(43) 


.11 (-.40-61) 


BDItlj 


6 


6 


5 


6 


6 


5 


1 9.46(7.35) 


18.13(7.01) 


.81(43) 


.19(-.42-.79) 


BDItlj 


6 


6 


5 


5 


5 


5 


19.55(4.95) 


19.51(6.55) 


.02098) 


.01 (-59-.60) 


BDIt2j 


4 


4 


4 


6 


6 


5 


22.26(7.80) 


20.92(6.85) 


,73(,48) 


.18C-.14-.80) 


BDH2T 


6 


6 


6 


6 


6 


5 


16.42(5.77) 


17.71(6.46) 


-.79(45) 


-.21 (-.81 -.40) 


SCLtU 


6 


6 


5 


6 


6 


6 


20.08(7.80) 


17.33(8.50) 


1 .92008) 


.34(-.35-1.01) 


SCLtl t 


/ 


/ 


7 


5 


5 


4 


15.75(6.63) 


1 9.63(6.89) 


-1.4O019) 


-.S8(-1 .23-.10) 


SCLt2j 


5 


5 


5 


/ 


/ 


6 


1 8.02(9.26) 


14.67(8.93) 


2.O90O6) 


.370.32-1 .04) 


SCLt2T 


6 


6 


5 


5 


5 


4 


1 8.97(4.02) 


21.10(7.31) 


-1.04(32) 


-36(-1.0-,29) 



" If the effect size d is larger than .20 and the confidence interval for d does not include zero, the effect can be regarded as statistically significant. 
* p< .05. 



between adjacent terms. Only four terms could be dis- 
tinguished by these subgroups. So, these results suggest 
that rating scales in newly developed questionnaires that 
are intended to be applied in patients suffering from a 



depressive disorder should be limited to not more than 
four different verbal anchors. 

The results of this study are generally consistent to the 
findings of prior research. Chang et al. [11] examined 



Table 5 Distinguishability of adjacent terms and intraindividual stability of intensity terms 

distinguishable term intraindividual stability 



r>ti n t2 





20<d<.50 


.50<d<.80 


d>.80 


.20<d<.80 


.50<d<.80 


d>.80 


M,(SD0 


M 2 (SD 2 ) 


t(p) 


d(CI) 


all 


8 


8 


5 


7 


7 


5 


1 5.67(4.24) 


18.31(7.99) 


1 .80009) 


.41 (-.02-83) 


WSTj 


5 


5 


5 


7 


/ 


5 


16.93(4.12) 


17.94(7.23) 


.76(45) 


.1 7(-.44-.77) 


WSTT 


6 


6 


6 


/ 


/ 


6 


13.90(5.67) 


17.63(10.33) 


1.81 (.09) 


.450.15-1.03) 


agel 


8 


8 


7 


6 


6 


6 


11.63(5.26) 


1 1 .86(4.55) 


.16087) 


.050.56-1.07) 


age| 


5 


5 


5 


6 


6 


5 


1 8.08(6.67) 


22.29(10.71) 


1 36(.20) 


.470.15-1.07) 


male 


8 


8 


8 


7 


/ 


/ 


7.90(5.04) 


8.13(5.47) 


.18019) 


.04(-.17-.80) 


female 


/ 


/ 


5 


/ 


/ 


5 


1 7.69(4.87) 


20.60(9.44) 


1 .78010) 


39C-.13-.89) 


BDItlj 


5 


5 


5 


/ 


/ 


5 


13.35(5.67) 


17.39(7.88) 


1.32021) 


.590.41-1.19) 


BDItl t 


/ 


/ 


4 


5 


5 


4 


14.82(5.40) 


1 8.04(8.80) 


2.47(.03)* 


.440.17-1.04) 


BDIt2j 


5 


5 


5 


6 


6 


5 


20.92(6.85) 


22.26(7.80) 


1 ,79(,10) 


.18C-.44-.80) 


BDIt2| 


8 


8 


5 


6 


6 


5 


14.77(6.01) 


16.41(8.58) 


.58057) 


.22C-.40-.83) 


SCLtU 


5 


5 


5 


6 


6 


6 


17.69(5.00) 


19.12(8.09) 


1.05(31) 


.21 (-.47-88) 


SCLtl T 


6 


6 


5 


5 


5 


5 


15.45(6.74) 


1 7.98(8.48) 


1.59014) 


33C-.35-.99) 


SCLt2i 


5 


5 


5 


6 


6 


5 


16.15(5.34) 


19.14(10.12) 


1 .80(.09) 


37C-32-.104) 


SCLt2| 


6 


6 


5 


/ 


/ 


6 


12.73(5.59) 


18.98(6.94) 


1.74(. 10) 


.99(30-1.64)° 



° If the effect size d is larger than .20 and the confidence interval for d does not include zero, the effect can be regarded as statistically significant. 
* p< .05. 
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the evaluation of frequency terms of chronic fatigue 
patients using Rasch analysis and found a subtle but 
meaningful advantage for frequency terms providing a 
fuller coverage of the fatigue continuum. They argued 
that their results could be interpreted as indicating that 
frequency scales outmatch intensity scales psychometric- 
ally. We also found a slight advantage for frequency 
terms in regard to intraindividual stability, adding fur- 
ther evidence to the assumption that frequency scales 
might be easier to use for patients than intensity scales. 
However, it has to be kept in mind, that we found virtu- 
ally no differences with regard to distinguishability. 

Clinically, there is a difference between depression as 
an affective disorder and its symptoms and fatigue as a 
symptom accompanying certain medical diseases such as 
HIV or Cancer as assessed by Chang et al. [11]. How- 
ever, measuring fatigue in the course of chronic illness 
Chang and colleagues [11] used similar items for meas- 
uring depressive symptoms, e.g. questions about trouble 
starting activities, tiredness, fatigue or ability to do usual 
activities. Thus, in this respect our results may be 
deemed comparable to those reported by Chang et al. 
[11]. Nevertheless, additional research is needed to in- 
vestigate whether results generalize across different 
mental disorders and equally apply to large-scale popula- 
tion based samples. 

Case's findings [9] from the area of research on med- 
ical education showed that there is only poor congru- 
ence between medical professionals' ratings about the 
frequency terms used in medical multiple choice exami- 
nations. The present study suggests that this result also 
applies for clinical applications in terms that there is also 
only poor congruency between patients with a depres- 
sive disorder about frequency and intensity terms. 

In a study by Holsgrove and Elzubeir [10] terms like 
"never" or "always" which were expected to be rated as 
absolute "0%" or "100%" were indicated with a range up 
to 20%, similar to the patients' assessments in the 
present study, "never" was rated with a mean of 16% and 
"always" with a mean of 90%. 

Some potential limitations should be considered when 
interpreting the results reported in this study. The sam- 
ple size (n=44) was not large so that the stability of the 
applied statistics might be regarded as limited. However, 
we applied a longitudinal design which improves statis- 
tical power for all comparisons concerning intraindivi- 
dual stability and power analysis indicated that the 
design of the present study was not underpowered to de- 
tect the effects we were interested in. To ensure homo- 
geneity of the recruited patient sample and because it 
represents one of the most common and thus important 
groups of mental disorders [12-14] only patients with a 
depression as leading or secondary diagnosis were 
included and their indications have not been compared 



and adjusted to the results of a control group. Therefore, 
the degree to which the present results can be general- 
ized to patients with other mental disorders or patients 
without mental disorders might be limited and add- 
itional research is needed to investigate whether our 
results apply to other patients who are frequently subject 
to self-report assessments (e.g., patients with anxiety dis- 
orders) or large-scale population based investigations in 
which questionnaires might be applied as screeners for 
mental disorders. 

While it can be assumed that the percentage scaling 
used in the present study is intuitively understandable 
when judging intensity terms this might not apply to fre- 
quency terms in the same extent. So it could be possible 
that patients had more difficulties indicating intensity 
terms by ranking them in their personal range of under- 
standing. However, since results do not indicate that fre- 
quency or intensity terms can be deemed superior to the 
other regarding all three criteria evaluated in the present 
study but rather show a mixed picture there is no evi- 
dence that this potential bias might have affected the 
results systematically. 

There was no supervision of the patients while they 
were filling out the material so it can not be ruled out 
that some patients might have had problems grasping all 
instructions. However, the treating therapists who 
handed over the questionnaire package were explicitly 
advised to explain the assignment in detail and accord- 
ing to therapists' feedback all patients reported to under- 
stand all instructions. 

The study was carried out in German and the used 
terms were all extracted from commonly used self- 
report instruments that were developed or translated in 
German. Therefore the terms in the present study might 
not all exactly correspond between English and German, 
so the reader should have in mind that some terms 
which seem synonymic in English are not in German. 
Despite this limitation it has to be noted that in the 
study by Case [9] individuals without mental disorders 
showed poor congruency similar to what we found in 
our data although Case's study was carried out in Eng- 
lish. So given the limited previous research on this issue 
it can tentatively be assumed that those findings could 
be reproduced in different languages. 

To sum up, the reported results suggest that frequency 
terms seem to have a slight advantage over intensity 
terms in regard to higher intraindividual stability of 
mental representations while both groups of terms 
exhibited low interindividual congruency. Furthermore, 
from a psychometric perspective, patients differed in 
their ability to distinguish between different frequency 
terms and different intensity terms, respectively. If it is 
intended, that a given rating scale could be applied to all 
patients with a depressive disorder independently of 
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further patient characteristics (i.e., including older 
patients and those with low lexical experience) then no 
more than four different verbal anchors should be used. 

Conclusions 

The present results do not support a clear recommenda- 
tion on whether to choose frequency or intensity terms 
as verbal anchors of self-report rating scales in clinical 
applications. There is some preliminary evidence that 
frequency terms might have a slight advantage over in- 
tensity terms with regard to intraindividual stability 
across time so it might be advisable to use frequency 
terms when designing a self-report instrument that is 
intended to be applied in longitudinal assessments. 
Moreover, the present study suggests that no more than 
four different verbal anchors should be used together in 
rating scales as patients with a depressive disorder would 
not be able to reasonably differentiate more than these 
four. Generally, the results indicate that mental repre- 
sentations of imprecise terms on frequency or intensity 
can differ depending on patient characteristics (e.g., age, 
gender, mental symptom burden, lexical experience). 
Scale developers should account for this issue and care- 
fully deliberate about which and how many terms to be 
used in a rating scale. Further research should investi- 
gate to what extent these results generalize to patients 
with other mental disorders. 
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